OpenML <> Scikit-learn Hackathon
Logistics
Location
27th Floor, ring at “Probabl”
Montparnasse Tower 33 Avenue du Maine, West Entrance (on your left when leaving the train station), 27th Floor, ring at “Probabl”
Contact Persons
+33783822597 (François Goupil)
+33760407677 (Charlène Bizollon)
Communication Channel
Please join the OpenML slack server and the dedicated hackathon channel for easy communication and updates.
Slack: https://join.slack.com/t/openml/shared_invite/zt-2ktk2cj1c-r637o20pfCc0H7PS8OUGtA
Channel: conference
Wifi
SSD: :probabl.guest
PWD: :probabl.
Note: At some point be ready to use your own mobile data (we are experiencing some difficulties)
SSD: :probabl.eiffel-2.4
PWD: :probabl.
Note: low bandwidth
Schedule
June 24 - 09:00-18:00
09:00-09:30 | Welcome | Coffee and Croissants |
09:30-11:30 | Introduction | - Short presentation of the scikit-learn and OpenML projects + Probabl (Joaquin and Pieter for OpenML, Guillaume Lemaitre for scikit-learn, Yann Lechelle for Probabl) - Quick round where everyone introduces themselves. - Plan break-out sessions or suggest new ones. |
10:30-11:30 | Breakout | (1) Organizing community events and Onboarding Contributors (Maren) |
11:30-13:00 | Lunch | Bouillon Chartier |
13:00-14:00 | Breakout | (2) Governance, Funding and Sponsorship (Adrin + François) |
14:00-15:00 | Breakout | (3) Future Collaboration between scikit-learn and OpenML (Guillaume) |
15:00-18:00 | Code | Code: explore each other’s projects. |
After the official programme each day, there’s a suggested bar and restaurant to go to
Bar + Restaurant: Le Falstaff
June 25 - 09:00-18:00
09:00-10:00 | Coffee/croissants | Croissant Talk by Joaquin |
10:00-11:00 | Breakout | (6) Collaboration Ecosystem for Open-Source Machine Learning |
11:00-12:00 | Breakout | (7) Academic and Industrial Scope of OpenML and Probabl in AI - Collaboration |
12:00-13:00 | Lunch | TranTranZai |
13:00-14:00 | Breakout | (5) Probabl Product Technical Discussion (Camille) |
14:00-17:00 | Coding | Coding |
17:00-18:00 | Breakout | (4) Development Tooling and Workflows |
Bar + Restaurant: Food Society Paris
June 26 - 09:00-13:00
9:00-10:00 | Breakout + coffee/croissants | Joaquin et al.: we have a bit of a delay checking out of our Airbnb but will be there shortly. |
10:00-12:00 | TBD | Open / Ad-hoc |
12:00-13:00 | Wrap-up | |
13:00-14:00 | Lunch + end of the hackathon | Subway in Montparnasse |
Breakout Sessions Ideas
This document contains the preliminary agenda and suggestion for OpenML <> scikit-learn hackathon paris ‘24. Breakout sessions are discussions where we can brainstorm or exchange our experiences on specific topics. Feel free to propose additional sessions.
💡 Feel free to add new session topics below, there is a template at the end.
1. Organizing Community Events and Onboarding Contributors [Day 1]
leader: Maren Westermann
description:
- Share our experiences organizing hackathons. How do you attract attendees? How do you make sure that the work at a hackathon is fruitful? Where should you organize your hackathons, and how are they funded? Are online open-source sprints/hackathons an option for you?
- What process and documentation should be in place to help onboard new contributors? How to get them started effectively, and how do you make sure they stay with the project?
- How do we get our projects known to users?
notes:
-
community sprints: everyone is invited to take part, in particular newcomers, to contribute to an open source project
-
Important (FYI: we are a bit out of sync)
-
Have a list of curated issues
-
Start with documentation issues for new contributors
-
Come up with issues before the sprint need a curated list for first time contributors (especially beginner friendly ones)
-
meta-issues for a group of related issues: once one is fixed it can serve as a contribution template
-
documentation issues: people start by reading the documentation around the issues
-
documentation about how to contribute was missing and is too long
-
rewrote the contributors’ guide to be more concise but more beginner friendly + some video tutorials to get started with github-based contributions
-
keeping beginner issues away for the sprints (so other people don’t jump on it before)
-
difference between OpenML Hackathon and Scikit-learn sprints
-
OpenML Hackathons are 1 week and bigger
-
Core developer sprints vs. New contributors sprints
-
Openml has 7 repositories with different programming languages (backend, APIs, frontend,…)
-
Some documentation for first time contributors but not exhaustive
-
One-to-one mentoring to get started with a working dev setup
-
Typically requires several days’ investment
-
onboarding new contributors online and in person are two different processes
-
Hard to make people feel connected and stay long-term
-
Social aspect is important: organizing recurrent events (every few months) to development a more long term engagement
-
Joint Pyladies Paris / scikit-learn core contributors events
-
near one-to-one mentoring
-
recurrent every few months
-
a few hours in the evening
-
retention is low but allowed to developed social bounds
-
Important to have maintainers be present to slowly build a connection.
-
Retention is low, be realistic on expectations - but the outliers are what matters
-
Personal connection: if they know you, people are more likely to contribute
-
OpenML hackathons are useful for core maintainers to secure some time to contribute to the project for several solid hours in a row.
-
0 full-time contributor.
-
part time engineers for academic projects
-
nice locations because thanks to EU funding
-
Can OpenML use students to help contribute feature
-
Hard to get high-quality submissions
-
How to incentivize people?
-
Build career: show of with sklearn contribution on a resume (less long-term contribution)
-
Sense of community
-
Useful for your own research
-
Hard to have ‘flashy results’ (e.g. genAI apps), how can we solve that?
-
How to scale time investment?
-
PyLadies / sprints: only few hours in the evening (6-9pm)
-
Every 2 months, 15-30 (capped) people show up
-
How are sprints structured?
-
Pre-sprint: online, so people have the right setup
-
At least one organizer (e.g. Maren for pyladies, supported by core devs)
-
Shortlist of issues for each sprint
-
Paid internships: great way to find good people and build
-
Requires funding
-
Mentoring takes time, but helps people take over some tasks
-
Slack discussions
-
Generative AI? Bad quality, wastes time. SKlearn only allows human contributions.
-
How to focus attention? E.g. key project this quarter?
2. Governance, Funding and Sponsorship
leader: Adrin + François
description: What are our experiences with our governance structures? What are opportunities for open source projects to make money to pay for e.g., server costs, organizing events, and so on? How do we argue the importance of our projects to motivate a funder/sponsor? Can we quantify our contribution?
notes:
-
Governance is a living document. It matters for the community.
-
How to combine an open-source library with a for-profit company
-
Be clear about what parts are community-”owned” and which parts are company-owned
-
Keep discussions of the community aspect on community Slack. Decision processes must remain open.
-
write public version of important decisions
-
Still creates confusion (for users and contributors)
-
Keep people informed either through mailing lists, monthly meetings etc. (This takes a long time, but is worth it in the long run)
-
communicate upcoming discussions in mailing list
-
contributing to sklearn open doors to work at companies
-
INRIA foundation:
-
50k EUR for 2 meetings yearly where company can state priorities
-
No requirement, only if useful for community
-
Is it sustainable? For academic salaries. As long as sklearn stays useful companies will keep doing this. Requires that someone at the sponsoring company cares that sklearn doesn’t decline.
-
Also advertising (logo on website)
-
Modelled on Linux foundation.
-
Probabl:
-
Will put a RE to work on a certain issue, but for a lot more money
-
Companies prefer this (they don’t know how to hire a good sklearn engineer)
-
Leadership by effort:
-
Put your own time into the aspects that are important to you
-
NVIDIA: Have people work at other companies to work on sklearn
-
How much effort?
-
For projects that need faster cycles, create a separate package (e.g. skops, hazardous,…) - but this creates maintenance work.
-
Hard to get new reviews in since it takes so long. Multiple rounds of reviews even for a simple spelling error.
-
Library for putting sklearn models into prod → skops
-
Having documentation lead, community interaction lead,… does it help and how much?
-
Lowers the bar (core dev is a really high bar)
-
Speeds up decisions
-
Need more people in the different teams
3. Future Collaboration between Scikit-learn and OpenML [Day 1, 2 sessions]
leader: Guillaume Lemaitre
description: scikit-learn can fetch datasets from OpenML, users can automatically evaluate scikit-learn models on OpenML tasks. What other future collaborations are interesting to explore?
-
fetch_openml / download_openml improvements (parquet?)
-
dataset upload via parquet → coming (not fully supported by python API yet)
-
croissant integration in scikit-learn openml data fetcher? → Tuesday morning
-
provenance tracking and reproducibility for openml dataset (make it standard to provide a script, possibly hosted externally on GitHub or similar, to show how to reconstruct the openml hosted parquet file from the original dataset format/location.
-
collaborative feedback (per dataset issue tracker) to report and discuss dataset related problems with dataset owner/uploader
-
If fetch-openml fails, what to do?
-
add support for benchmarking? : benchopt
notes:
-
Fetch_openml
-
ARFF parser is a headache
-
Logically, it makes sense to first download the data file localy and then load it with pandas or polars (pandas in rust)
-
sklearn does not load parquet right now
-
Sparse datasets are still an issue (not supported by parquet)
-
Make all sparse dataset dense and store them in parquet (will still compress nicely). Most sparse datasets aren’t that large.
-
Some of these datasets may be one-hot-encoded datasets
-
Pyarrow is most supported (fastparquet is not). Polars can read parquet natively. Pyarrow does not support pyodide.
-
Have an explanation for differences between versions
-
Versions discussion
-
Versions of dataset on openml very confusing
-
Version not related lineage of dataset, very confusing to user
-
Versions of datasets not searchable
-
Use case: go on openml, search for a dataset, “which version of the dataset to select?”
-
Benchmarking
-
interesting for probabl.ai to share models and benchmarks on OpenML?
-
Sklearn Pipeline Representation
-
Use HTML widget for sklearn pipeline diagrams
todo:
-
openml: check whether all parquet file can be read with polars and pandas
-
openml: convert all sparse datasets to dense to store then in parquet
-
openml: have an explanation for differences between versions. When people upload a new version of a dataset, ask for an explanation.
-
openml: sort dataset by the quality of the datasheet. Show user/datasetname/id as the name on the webui, remove/rename “version”
-
openml website: implement a way to open an issue to contact the dataset owner
-
openml: datasheet has section on preprocessing, where people can point to a github link with preprocessing code, encourage users to do this (e.g. dataset quality score) and allow people to report problems
-
sklearn: try to load parquet files from OpenML in fetch_openml
-
openml: visualization of the sklearn pipelines (flow)
3.5 Croissant Talk
notes:
-
Croissant is a metadata description format
-
Ml datasets are a combination of structured and unstructured data, which make them complicated to manage
-
Croissant was built on top of schema.org, and has more details relative to it
-
The format has 4 layers
-
dataset level metadata
-
resource description
-
content structure
-
ml semantics
-
Croissant does not require any changes to underlying data
-
Analysis and visualization tools work out of the box for all datasets
-
Using croissant, datasets can be exposed consistently throughout platforms
-
Collaborations with google, hugging face, google dataset search also exist
-
Openml has deeper dataset description by default, slightly lesser in HF and kaggle
-
Once loaded, datasets can be imported elsewhere (torch, tf etc) easily
-
Croissant editor - web app where you can use a GUI to enter the dataset descriptions
-
NeurIPS also now recommends using the Croissant format
-
Supports the Core RAI vocab for explainable AI
-
If images/other files - points to the path
todo:
- integer precision and more detailed dtypes
- How are uploaded files linked to each other?
- Lineage of datasets
4. Development Tooling and Workflows
leader: Pieter
description: Automation is important to create more sustainable workloads and generally improves overall project quality. What tooling and workflows are employed in your projects to run tests, ensure code quality, help contributors, and so on? Which do you find most useful? Are there decisions have you come to regret? What are your major pain points?
What are our responsibilities as open source projects, should we be embracing platforms such as e.g., CodeBerg/Forgejo more?
Notes:
-
Switch to open-source tools like CodeBerg once if offers more conveniences
-
There is a GitHub maintainer org that you can apply to (if you are maintainer of an important enough package) that can give you more direct access to GH dev/projects.
-
Use of Azure workflows in scikit-learn is largely historical, but also provides a spread over different (free) usage limits
-
GPU actions with a limited budget
-
GitHub actions workflow problems
-
testing is a pain and not really supported (easier on Azure)
-
badish documentation but better than Azure
-
run documentation examples that are linked to changes in PR diff - use Circle CI because it can easily render the generated HTML in the browser (as opposed to GH where you download the artifact)
-
bot for linter errors to post it as comment helped a lot
-
aiming for 100% code coverage, including all validation though that is centralized. Disable coverage for certainly not tested parts of the code. Also test errors and types and warnings.
5. Probabl Product Technical Discussion
leader: Camille Troillard
description: Presentation/discussion of the Probabl technical product and potential collaboration.
-
What to do to put sklearn in production, to make it commercially viable
-
Help data scientists do better ML
-
Better understand their model’s behavior
-
Build something that we’re proud of (and we’re picky)
-
Let people do what they do (don’t interfere), but show interesting things along the way
-
You should not require a platform, but it should be very easy to switch to a platform
-
Educate people. E.g. you’re changing the metric but this metric doesn’t make sense.
-
Interactive dashboard that shows results of your experiment
-
code and outputs side by side
-
Like Weights and Biases, but runs locally
-
Outputs data in a portable DB, results are registered and predictions are shown when they are available
-
unified API to the whole infrastructure stack (like MetaFlow)
-
Button to ‘push to production’ or ‘push to OpenML’ depending on the user
-
“We’ve been spoonfed microservices in order to become addicted to CSPs”
Feedback for OpenML
- Have a clear tagline, e.g. ‘Frictionless ML resources’
- Better search interface
- Nice visualizations for the run page
- Fast website
6. Collaboration Ecosystem for Open-Source Machine Learning
leader: Lennart Purucker
description: What other open-source frameworks are struggling with the same questions we are struggling with? Should we reach out to them? Is there a need for a collaboration ecosystem in open-source machine learning/AI? What are lessons learned from which others might benefit? What are lessons learned from others from which we might benefit?
notes:
-
Struggles for Open Source
-
Copying and learning from scikit-learn projects
-
Bots, CI/CD, CI logic
-
helps with CI/CD setup and provide more documentation on how to setup a open source project
-
Main issue is human traffic for open-source
-
opening issues, PRs, …
-
What other open-source frameworks are struggling with the same questions we are struggling with?
-
https://learn.scientific-python.org/contributors/setup/ecosystem/
-
ML Backbone
-
Scikit-learn, PyTorch, TensorFlow, MLR, MLJ
-
XGBoost, LightGBM, CatBoost
-
OpenML, Pandas, NumPy, SciPy, Polars
-
Python Backbone
-
Pip / PyPi, Conda, vu
-
Ray, Joblib
-
ML Applications / AutoML / …
-
AMTLK, Auto-Sklearn, FLAML, AutoGluon, H2O …
-
Should we reach out to them?
-
Company-driven open-source vs. community-driven open-source
-
company-driven example
-
tensorflow, (PyTorch)
-
Internal CI vs. open-source CI
-
community-driven
-
scikit-learn
-
Via GitHub
-
Is there a need for a collaboration ecosystem in open-source machine learning/AI?
-
Only if we have problems, otherwise unnecessary overhead.
-
Are they my dependencies or am I there dependencies?
-
What are lessons learned from which others might benefit? / What are lessons learned from others from which we might benefit?
-
mostly the governance documents
-
document CI
-
only start testing / maintaining other environments one request / when issues arise
-
https://scikit-learn.org/stable/developers/minimal_reproducer.html
7. Academic and Industrial Scope of OpenML and Probabl in AI
leader: Lennart Purucker
who else is joining? Yann Lechelle
description: Where do we see ourselves in the general field of AI/ML? Are we only tabular data? Are we connected to GenAI, Computer Vision, and NLP? How is our connection to industry applications? How do we effectively explain our position to stakeholders (who read too much about GenAI)?
-
Input Modalities
-
Tabular (OpenML, Scikit-learn)
-
Time Series
-
Vision (OpenML Soon)
-
NLP
-
Graphs
-
(Other)
-
Output Modalities / Task
-
Scalar Regression (OpenML, Scikit-learn)
-
Quantile Regression (OpenML, Scikit-learn)
-
Multiclass Classification (OpenML, Scikit-learn)
-
No-target / Unsupervised / Data Insights (OpenML, Scikit-learn)
-
Survival Analysis (Scikit-learn)
-
Forecasting
-
Anomaly Classification
-
Anomaly Detection (OpenML, Scikit-learn)
-
Generative AI: Structured Predictions
-
ML Techniques in AI/ML
-
Traditional ML Algorithms (SVM, RF, Boosting) (OpenML, Scikit-learn)
-
Traditional Deep Learning (OpenML)
-
Large Foundation Models
-
What do Stakeholders Understand?
-
Time Series
-
GenAI
-
Notes:
-
scikit-learn limitation is the API definition
-
probabl: “own your data science”
-
border scope, may include other things besides scikit-learn
-
also deep learning and large foundation models
-
scope is wide around open-source technology
-
can we connect OpenML to probabl scope
-
“exporting” the API?
-
LLMs “do” UX for ML
Aggregated To-Dos
Openml
- check whether all parquet file can be read with polars and pandas
- convert all sparse datasets to dense to store then in parquet
- have an explanation for differences between versions. When people upload a new version of a dataset, ask for an explanation.
- sort dataset by the quality of the datasheet. Show user/datasetname/id as the name on the webui, remove/rename “version”
- website: implement a way to open an issue to contact the dataset owner
- datasheet has section on preprocessing, where people can point to a github link with preprocessing code, encourage users to do this (e.g. dataset quality score) and allow people to report problems
- visualization of the sklearn pipelines (flow)
- data quality plots
- Does the UX of OpenML need work
probabl
- try to load parquet files from OpenML in fetch_openml
croissant
- integer precision and more detailed dtypes
- How are uploaded files linked to each other